Data Cleaning
Canonical correlation regression with noisy data
We study instrumental variable regression in data rich environments. The goal is to estimate a linear model from many noisy covariates and many noisy instruments. Our key assumption is that true covariates and true instruments are repetitive, though possibly different in nature; they each reflect a few underlying factors, however those underlying factors may be misaligned. We analyze a family of estimators based on two stage least squares with spectral regularization: canonical correlations between covariates and instruments are learned in the first stage, which are used as regressors in the second stage. As a theoretical contribution, we derive upper and lower bounds on estimation error, proving optimality of the method with noisy data. As a practical contribution, we provide guidance on which types of spectral regularization to use in different regimes.
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Quality > Data Cleaning (0.61)
Error Correction Code Transformer
Error correction code is a major part of the physical communication layer, ensuring the reliable transfer of data over noisy channels.Recently, neural decoders were shown to outperform classical decoding techniques.However, the existing neural approaches present strong overfitting, due to the exponential training complexity, or a restrictive inductive bias, due to reliance on Belief Propagation.Recently, Transformers have become methods of choice in many applications, thanks to their ability to represent complex interactions between elements.In this work, we propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths.We encode each channel's output dimension to a high dimension for a better representation of the bits' information to be processed separately.The element-wise processing allows the analysis of channel output reliability, while the algebraic code and the interaction between the bits are inserted into the model via an adapted masked self-attention module.The proposed approach demonstrates the power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins, at a fraction of their time complexity.
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Quality > Data Cleaning (0.67)
Understanding the Gain from Data Filtering in Multimodal Contrastive Learning
Pareek, Divyansh, Oh, Sewoong, Du, Simon S.
The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $η\in(0,1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\frac{1}{η\sqrt{n}}$, and $(ii)$ the error with teacher-based filtering is upper bounded by $\frac{1}{\sqrt{ηn}}$ in the large $η$ regime, and by $\frac{1}{\sqrt{n}}$ in the small $η$ regime.
- North America > United States > Washington > King County > Seattle (0.14)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.45)
- Information Technology > Data Science > Data Quality > Data Cleaning (0.34)
OpenGVL -- Benchmarking Visual Temporal Progress for Data Curation
Budzianowski, Paweł, Wiśnios, Emilia, Góral, Gracjan, Kulakov, Igor, Petrenko, Viktor, Walas, Krzysztof
Data scarcity remains one of the most limiting factors in driving progress in robotics. However, the amount of available robotics data in the wild is growing exponentially, creating new opportunities for large-scale data utilization. Reliable temporal task completion prediction could help automatically annotate and curate this data at scale. The Generative Value Learning (GVL) approach was recently proposed, leveraging the knowledge embedded in vision-language models (VLMs) to predict task progress from visual observations. Building upon GVL, we propose OpenGVL, a comprehensive benchmark for estimating task progress across diverse challenging manipulation tasks involving both robotic and human embodiments. We evaluate the capabilities of publicly available open-source foundation models, showing that open-source model families significantly underperform closed-source counterparts, achieving only approximately $70\%$ of their performance on temporal progress prediction tasks. Furthermore, we demonstrate how OpenGVL can serve as a practical tool for automated data curation and filtering, enabling efficient quality assessment of large-scale robotics datasets. We release the benchmark along with the complete codebase at \href{github.com/budzianowski/opengvl}{OpenGVL}.
- Europe > Poland > Masovia Province > Warsaw (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Data Science > Data Quality > Data Cleaning (0.61)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Consistency Flow Model Achieves One-step Denoising Error Correction Codes
Lei, Haoyu, Lau, Chin Wa, Zhou, Kaiwen, Guo, Nian, Farnia, Farzan
Error Correction Codes (ECC) are fundamental to reliable digital communication, yet designing neural decoders that are both accurate and computationally efficient remains challenging. Recent denoising diffusion decoders with transformer backbones achieve state-of-the-art performance, but their iterative sampling limits practicality in low-latency settings. We introduce the Error Correction Consistency Flow Model (ECCFM), an architecture-agnostic training framework for high-fidelity one-step decoding. By casting the reverse denoising process as a Probability Flow Ordinary Differential Equation (PF-ODE) and enforcing smoothness through a differential time regularization, ECCFM learns to map noisy signals along the decoding trajectory directly to the original codeword in a single inference step. Across multiple decoding benchmarks, ECCFM attains lower bit-error rates (BER) than autoregressive and diffusion-based baselines, with notable improvements on longer codes, while delivering inference speeds up from 30x to 100x faster than denoising diffusion decoders.
- Asia > China > Hong Kong (0.04)
- North America > United States > California (0.04)
- Asia > China > Ningxia Hui Autonomous Region > Yinchuan (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Data Science > Data Quality > Data Cleaning (0.92)
- (2 more...)
ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features
Lin, Ye Bhone, Aung, Thura, Thu, Ye Kyaw, Oo, Thazin Myint
Abstract--This paper investigates sequence-to-sequence T ransformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IP A and alignment information. T o our knowledge, this is the first study addressing ASR error correction specifically for Burmese. W e evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word-and character-level accuracy over baseline outputs. The proposed AEC model, combining IP A and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (3 more...)
- Information Technology > Data Science > Data Quality > Data Cleaning (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Continual Error Correction on Low-Resource Devices
Paramonov, Kirill, Ozay, Mete, Mystakidis, Aristeidis, Tsalikidis, Nikolaos, Sotos, Dimitrios, Drosou, Anastasios, Tzovaras, Dimitrios, Kim, Hyunjun, Chang, Kiseok, Mo, Sangdok, Kim, Namwoong, Yoo, Woojong, Moon, Jijoong, Michieli, Umberto
The proliferation of AI models in everyday devices has highlighted a critical challenge: prediction errors that degrade user experience. While existing solutions focus on error detection, they rarely provide efficient correction mechanisms, especially for resource-constrained devices. We present a novel system enabling users to correct AI misclassifications through few-shot learning, requiring minimal computational resources and storage. Our approach combines server-side foundation model training with on-device prototype-based classification, enabling efficient error correction through prototype updates rather than model retraining. The system consists of two key components: (1) a server-side pipeline that leverages knowledge distillation to transfer robust feature representations from foundation models to device-compatible architectures, and (2) a device-side mechanism that enables ultra-efficient error correction through prototype adaptation. We demonstrate our system's effectiveness on both image classification and object detection tasks, achieving over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets while maintaining minimal forgetting (less than 0.02%) and negligible computational overhead. Our implementation, validated through an Android demonstration app, proves the system's practicality in real-world scenarios.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Africa > South Africa (0.05)
- Asia > South Korea (0.05)
- (5 more...)
- Information Technology > Data Science > Data Quality > Data Cleaning (1.00)
- Information Technology > Communications > Mobile (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- (2 more...)
Step-E: A Differentiable Data Cleaning Framework for Robust Learning with Noisy Labels
Modern deep networks achieve impressive performance when trained on large, clean, and carefully curated datasets. In realistic data mining scenarios, however, labels come from heterogeneous sources such as crowdsourcing, weak supervision, or heuristic rules and are therefore noisy [18, 3]. Human annotation errors, ambiguous images, and domain shifts all contribute to mislabeled or outlier samples that can harm generalization. In image classification, for example, web-scale datasets often contain wrong tags or near-duplicate images with conflicting labels; in user-generated content analysis, spam or off-topic posts corrupt the training distribution. Data cleaning is widely recognized as crucial [15] but is typically performed before model training using hand-crafted rules or separate anomaly detectors [9, 16]. This two-stage design has two drawbacks: (i) it requires domain expertise or extra supervision to specify cleaning rules and thresholds; (ii) it decouples cleaning from model optimization, so the decisions do not directly leverage discriminative feedback from the task model. Some high-loss samples may still be informative "hard cases," whereas others are truly corrupted and should be discarded. We explore a different paradigm: can the model learn which samples to trust during training, treating data cleaning as an integral, differentiable part of optimization?
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Quality > Data Cleaning (0.93)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Switzerland > Basel-City > Basel (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (5 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Health & Medicine > Therapeutic Area > Dermatology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Nuclear Medicine (0.67)
- Information Technology (0.67)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Europe > United Kingdom > Scotland > City of Glasgow > Glasgow (0.04)
- (9 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Overview (0.93)
- Law (0.67)
- Energy (0.46)
- Information Technology (0.46)
- Health & Medicine (0.46)